04. Types of Sampling

Types Of Sampling

Types of Sampling

While web and other online experiments have an easy time collecting data,
collecting data from traditional methods involving real populations is a much
more difficult proposition. If you need to perform a survey of a population, it
could be unreasonable in both time and money costs to try and collect thoughts
from every single person in the population. This is where sampling comes in.
The goal of sampling is to only take a subset of the population, using the
responses from that subset to make an inference about the whole population.
Here, we'll cover two basic probabilistic techniques that are commonly used.

The simplest of these approaches is simple random sampling. In a simple
random sample, each individual in the population has an equal chance of being
selected. We just randomly make draws from the population until we have the
sample size desired; your sample size depends on the level of uncertainty you
are willing to have about the collected data. Since everyone has an equal
chance of being drawn, we can expect the feature distribution of selected units
to be similar to the distribution of the population as a whole. In addition, a
simple random sample is easy to set up.

However, it is possible that certain groups are underrepresented in a simple
random sample, especially those that make up a low proportion of the population.
If there are certain rarer subgroups of interest, it can be worth adding one
additional step and performing stratified random sampling. In a
stratified random sample, we need to first divide the entire population into
disjoint groups, or strata. That is, each individual must be a part of one
group, and only one group. For example, you could divide people by gender (male,
female, other), or age (e.g. 18-25, 26-35, etc.).

Then, from each group, you take a simple random sample. In a proportional sample, the sample size is proportional to how large the group is in the full
population. For example, if you require 1000 data points, and stratified
individuals of proportion {0.5, 0.3, 0.2}, then you would take 500 people from
the first group, 300 from the second, and 200 from the third. This guarantees a
certain level of knowledge from each subset, and theoretically a more
representative overall inference on the population.

An alternative approach is to take a nonproportional sample from each group.
For example, we could simply sample 500 people from each group. Computing the
overall statistics in this case requires weighting each group separately, but
this extra effort offers a higher understanding of each subgroup in a deeper
investigation.

Side Note: Non-Probabilistic Sampling

As noted at the start, the goal of sampling is to use a subset of the whole
population to make inferences about the full population, so that we didn't need
to record data from everyone. To that end, probabilistic sampling techniques
were described above to try and obtain a sample that was representative of the
whole. However, it's useful to note that there also exist non-probabilistic
sampling techniques that simplify the sampling process, at the risk of harming
the validity of your results. (We'll discuss the term 'validity' later in the
lesson.)

For example, a convenience sample records information from readily available
units. Studies performed in the social sciences at colleges often fall into
this kind of sampling. The people participating in these tasks are often just
college students, rather than representatives of the population at large. When
performing inferences from this type of study, it's important to consider how
well your results might apply to the population at large.

One notable example of a convenience sample resulting in a grave error comes
from the prediction made by magazine "The Literary Digest" on the 1936 U.S.
presidential election. While they predicted a healthy victory by candidate
Alf Landon, the final result ended with a landslide victory by opposing
candidate Franklin D. Roosevelt. This major error is attributed to their
methods capturing a non-representative sample of the population, which included
looking at the results of a mail-in survey from their magazine readers. Since
the mail-ins were voluntary, and the magazine subscribers were already not
well-representative of the general population, focusing on the people who returned
surveys gave a large bias toward Landon.